How to create a binary variable that identify whether another variable has duplicates in the dataset using Stata?*

smith Jason

Join Date: Sep 2020

Posts: 380
#1

How to create a binary variable that identify whether another variable has duplicates in the dataset using Stata?*

25 Feb 2022, 15:21

The dataset is laid out in longitudinal format where the patient id (pid) is the first column, and the second column "PE"is if
the patient received duplicated physical check-up during their hospital stay. I want to create a new variable "RE" that can achieve
the goal like this: if PE has duplicates, then RE=1; If not, RE=0.

* Example generated by -dataex-.
clear
input byte (pid PE)
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 3
3 1
3 1
3 2
3 3
4 1
4 1
4 2
4 3
end
Thank you for your help!
Tags: None

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

25 Feb 2022, 15:32

Code:

bysort pid (PE) : gen RE = sum(PE==PE[_n-1])
bysort pid (PE) : replace RE = RE[_N] > 0

Comment

smith Jason

Join Date: Sep 2020

Posts: 380
#3

25 Feb 2022, 15:38

Originally posted by Bjarte Aagnes View Post

Code:

bysort pid (PE) : gen RE = sum(PE==PE[_n-1]) bysort pid (PE) : replace RE = RE[_N] > 0

Thank you! Could you please explain your code?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36059
#4

26 Feb 2022, 04:05

The code counts duplicates (within patients) of any value as after sorting a duplicate will show up as being equal to the previous value. If your example, the duplicates are in following observations, but the code doesn't assume that.

The code then reduces that to 1 for duplicates exist and 0 otherwise.
1 like
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 786
#5

26 Feb 2022, 04:47

The concepts are explained in [D] by -- Repeat Stata command on subsets of the data, and Cox, N. J. 2002. Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.

Adding a longer more detailed explanation, hopefully not introducing errors...

First,
the expression gen RE = sum(PE==PE[_n-1]) is calculated for each by group pid on data sorted within by group by PE.

Thus, within each pid with data sorted by PE:
The system variables _n and _N represent the number of the current observation (_n) and the total number of observations (_N) observations in the current by group.

The relational expression PE==PE[_n-1] will return 0 if false and 1 if true. Here 0 if the current observation of PE is different from the previous observations of PE. The expression will be true and return the value 1 for any record where the current observation of PE is the same as the previous observation.

The function sum() return the running/cumulative sum over observations of the expression (PE==PE[_n-1]) . Thus, for the last observation within by group pid the running sum will be 0 if no duplicates where found.
Second,
bysort pid (PE) : replace RE = RE[_N] > 0 replace RE within by group with 1 or 0 depending on the expression RE[_N] > 0 which will be false returning 0 if no duplicates where found.
The following code may help illustrate some of the above:

Code:

bysort pid (PE) : gen n = _n // current obs within by group bysort pid (PE) : gen N = _N // last obs within by group bysort pid (PE) : gen isN = (_n==_N) // true of last obs within bygroup bysort pid (PE) : gen byte PEeqlagPE = ( PE==PE[_n-1] ) // true of PE equals previous PE within bygroup bysort pid (PE) : gen byte sumPEeqlagPE = sum(PEeqlagPE) // cumulative/running sum of PEeqlagPE within bygroup bysort pid (PE) : gen pidhasdups = sumPEeqlagPE[_N] > 0 // true if sumPEeqlagPE[_N] > 0 within bygroup list pid PE n-pidhasdups , sepby(pid) compress abbrev(32)
1 like
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#6

27 Feb 2022, 23:05

Originally posted by Bjarte Aagnes View Post

The concepts are explained in [D] by -- Repeat Stata command on subsets of the data, and Cox, N. J. 2002. Speaking Stata: How to move step by: step. Stata Journal 2: 86–102.

Adding a longer more detailed explanation, hopefully not introducing errors...

First,
the expression gen RE = sum(PE==PE[_n-1]) is calculated for each by group pid on data sorted within by group by PE.

Thus, within each pid with data sorted by PE:
The system variables _n and _N represent the number of the current observation (_n) and the total number of observations (_N) observations in the current by group.

The relational expression PE==PE[_n-1] will return 0 if false and 1 if true. Here 0 if the current observation of PE is different from the previous observations of PE. The expression will be true and return the value 1 for any record where the current observation of PE is the same as the previous observation.

The function sum() return the running/cumulative sum over observations of the expression (PE==PE[_n-1]) . Thus, for the last observation within by group pid the running sum will be 0 if no duplicates where found.
Second,
bysort pid (PE) : replace RE = RE[_N] > 0 replace RE within by group with 1 or 0 depending on the expression RE[_N] > 0 which will be false returning 0 if no duplicates where found.
The following code may help illustrate some of the above:

Code:

bysort pid (PE) : gen n = _n // current obs within by group bysort pid (PE) : gen N = _N // last obs within by group bysort pid (PE) : gen isN = (_n==_N) // true of last obs within bygroup bysort pid (PE) : gen byte PEeqlagPE = ( PE==PE[_n-1] ) // true of PE equals previous PE within bygroup bysort pid (PE) : gen byte sumPEeqlagPE = sum(PEeqlagPE) // cumulative/running sum of PEeqlagPE within bygroup bysort pid (PE) : gen pidhasdups = sumPEeqlagPE[_N] > 0 // true if sumPEeqlagPE[_N] > 0 within bygroup list pid PE n-pidhasdups , sepby(pid) compress abbrev(32)

Thank you!
Comment

Announcement

How to create a binary variable that identify whether another variable has duplicates in the dataset using Stata?*

Comment

Comment

Comment

Comment

Comment